We analyze 39 autoimmune disease- and trait-associated SNP sets, obtained from the Supplemental table 1 of the Farh, K. K.-H., Marson, A., Zhu, J., Kleinewietfeld, M., Housley, W. J., Beik, S., … Bernstein, B. E. (2014). “Genetic and epigenetic fine mapping of causal autoimmune disease variants”“ Nature. doi:10.1038/nature13835.
We test each disease- and trait-associated SNP set for enrichment in cell/tissue-specific histone modification marks (broadPeak) obtained from the Roadmap Epigenomics project. The results of this analysis are stored in the data/Roadmap_broadPeak2/ folder.
Set up the environment
suppressMessages(source("utils2.R")) # See the required packages there
suppressMessages(source("episimilarity.R"))
We first load annotation data containing disease name-category mappings. We have four categories of the diseases/traits - “immunological”, “neurological”, “hematological” and “metabolic”.
We load the matrix of p-values using the ‘load_gr_data’ function. The p-values are -log10-transformed, and a “-” sign is added to indicate depteted associations. This transformation will make distribution of the p-values centered around 0, with smaller/larger numbers indicating more signficant depletions/enrichments, respectively.
Note that we may also use the matrix of odds ratios. The ‘load_gr_data’ function will transform the odds ratios into log2 scale, so lack of effect will be 0, depteled effect will be negative and enriched effect will be positive.
library(xlsx)
# Load term mapping
term.mapping <- read.xlsx2("data/icd9_mapping.xlsx", sheetName = "manual")
# Load the data
mtx <- load_gr_data("data/Roadmap_broadPeak2/matrix_PVAL.txt")
mtx <- mtx[, match(term.mapping$BED, colnames(mtx))] # Match and subset the data
# A vector of category names, same order as the correlation matrix
categoryNames <- term.mapping$Category[match(colnames(mtx), term.mapping$Name)]
names(categoryNames) <- term.mapping$Name[match(colnames(mtx), term.mapping$Name)]
# Set side colors
ColSideColors <- as.numeric(factor(categoryNames[colnames(mtx)]))
cbind((categoryNames), ColSideColors) # Sanity check
ColSideColors[ColSideColors == 1] <- "red" # Immunological
ColSideColors[ColSideColors == 2] <- "blue" # Metabolic
ColSideColors[ColSideColors == 3] <- "green" # Neurological
ColSideColors[ColSideColors == 4] <- "yellow" # Trait
We first check how the distributions of the -log10-transformed p-values for each sample look like side-by-side.
ggplot(melt(mtx), aes(x = Var2, y = value, fill = Var2)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + theme(legend.position = "none")
# coord_cartesian(ylim = c(-5, 5)) +
We use Pearson correlation coefficient to compare the profiles of the -log10-transformed p-values. It’s a good idea to scale and center the profiles (columns). Let’s check how the scaled data looks like.
mtx <- scale(mtx)
ggplot(melt(mtx), aes(x = Var2, y = value, fill = Var2)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + theme(legend.position = "none")
# coord_cartesian(ylim = c(-5, 5)) +
Epigenomic similarity analysis groups SNP sets by correlation of their epigenomic enrichemt profiles (sets of -log10-transformed p-values) using Pearson’s correlation coefficient.
# rcorr returns a list, [[1]] - correl coeffs, [[3]] - p-values. Type - pearson/spearman
mtx.cor <- rcorr(as.matrix(mtx), type = "pearson")[[1]]
# Optionally, try Spearman or Kendall correlation mtx.cor[[1]] <- cor(as.matrix(mtx), method='kendall')
The NxN square matrix of correlation coefficients, where N is the count of SNP sets, is visualized as a clustered heatmap. The distance and clustering parameters can be tweaked. By default, “euclidean” distance and “ward.D2” similarity metrics are used. The ‘mtx.plot’ function is a wrapper for plotting the clustered heatmap of correlation coefficients.
h <- mtx.plot(mtx.cor, SideColors = ColSideColors)
We can also evaluate clustering stability using ‘pvclust’ package.
library(pvclust)
# library(parallel) cl <- makeCluster(4, type='PSOCK') result <- parPvclust(cl=cl, mtx, method.dist=dist.method, method.hclust=hclust.method, nboot=10000, iseed=1) stopCluster(cl) saveRDS(result, 'data/pvClust_10000.Rds')
result <- readRDS("data/pvClust_10000.Rds")
# dev.off()
plot(result)
pvrect(result, alpha = 0.9, type = "leq", max.only = TRUE, lwd = 3)
hclustergram <- result$hclust # Make the heatmap plot whatever hclust object
And, we can visualize this clustering using the ‘aheatmap’ function from the ‘NMF’ package.
suppressMessages(library(NMF))
annot <- data.frame(Category = term.mapping$Category) #, Size=as.numeric(term.mapping$BEDcount))
annotColor <- list(Category = c("red", "blue", "green", "yellow")) #, Size=c('red', 'white', 'blue'))
h <- aheatmap(mtx.cor, Rowv = as.dendrogram(hclustergram), Colv = as.dendrogram(hclustergram), color = matlab.like(50), distfun = dist.method, hclustfun = hclust.method, annCol = annot, annColors = annotColor, cexCol = 1, cexRow = 1)
The heatmap object contains information about the clustering. This information can be visualized as a dendrogram, cut into separate clusters defined by the cut height (user defined). The cluster ordering can be saved.
# Empirically define clusters
par(mfrow = c(1, 2))
plot(as.hclust(h$Colv), cex = 0.3) # Plot dendrogram
cl_num <- 3 # Empirically set desired numter of clusters
cols <- rainbow(cl_num) # Make colors for clusters
rect.hclust(as.hclust(h$Colv), k = cl_num, border = cols) # Define the clusters by rectangles
hcut <- 30 # Empirically set height to cut the tree
as.hclust(h$Colv)$height %>% density %>% plot
abline(v = hcut)
mtx.clust <- h$Colv %>% mtx.clusters(height = hcut, minmembers = 3)
Cluster01 has 16 members
Cluster02 has 10 members
Cluster03 has 13 members
par(mfrow = c(1, 1))
# Save the results of clustering
write.table(as.data.frame(mtx.clust), "results/clustering_all.txt", sep = "\t", row.names = FALSE, quote = FALSE)
‘mtx.clust’ object now contains the cluster groups and labels. By default, if the number of SNP sets in each cluster is less than 3, such cluster is not considered for the differential enrichment analysis.
‘mtx.degfs’ function takes a matrix of odds ratios and compares their distributions between the clusters. Thus, for each GF, odds ratios of FOIs in one cluster are compared with odds ratios of FOIs in another cluster using Welch t-test. The results are outputted into ‘results/degfs_LABEL.xlsx’.
# A matrix to use for differential epigenomic analysis
mtx <- load_gr_data("data/Roadmap_broadPeak2/matrix_OR.txt")
mtx.degfs(mtx[, mtx.clust$eset.labels], mtx.clust, label = "broadPeak2")
The cell type specific analysis evaluates cell/tissue types with epigenomic marks most frequently and most significantly enriched with a given SNP set. The results are outputted into an Excel file, with each worksheet containing results for a specific SNP set.
mtx.cellspecific(mtx, "results/cellspecific_broadPeak2.xlsx")